Bootstrapping

Bootstrapping By Hand

The process of bootstrapping makes use of resampling with replacement. Thankfully, R has a sample function that returns from a vector or list, a sample of a given size taken with or without replacement. We make use of sample to create a function to bootstrap the median of a vector of values.

Create Functions

  b.median = function(data, num){
    ### Generate num resample from data --- one for each i = 1 to num
    resamples <- lapply(1:num, function(i){sample(data, replace=TRUE)})
    
    ## Get the median of each of the resamples
    r.median <- sapply(resamples, median)
    
    ### Compute the standard error of the median from the num medians
    std.err <- sqrt(var(r.median))
    
    ### Return a list containing the std error , the resamples, and their medians
    list(std.err = std.err, reamples = resamples, medians = r.median)
  }

Having created a function to bootstrap the median, it is easy to create a function to bootstrap the mean.

  ### Create a copy of the b.median function called b.mean
  b.mean <- b.median

We now edit the function by entering into the console one of

  ### Edit and replace
  fix(b.mean)

  ### Edit and assign
  b.mean <- edit(b.mean)

Note that we could have used edit to take care of both operations at the same time.

  b.mean <- edit(b.median)

This approach is fast but prone to overwriting the “good” version.

The b.mean function should be modified by replacing occurences of “median” with “mean”. This results in

  b.mean <- function(data, num){
  ### Generate num resample from data --- one for each i = 1 to num
  resamples <- lapply(1:num, function(i){sample(data, replace=TRUE)})
    
  ## Get the median of each of the resamples
  r.mean <- sapply(resamples, mean)
    
  ### Compute the standard error of the median from the num medians
  std.err <- sqrt(var(r.mean))
    
  ### Return a list containing the std error , the resamples, and their medians
  list(std.err = std.err, reamples = resamples, means = r.mean)
  }

Run the Functions

  ### Create 100 observations from an exponential with mean 10
  data <- round(rexp(100, 1/10))
  hist(data)

  qqnorm(data)
  qqline(data)

  qqplot(data, rexp(500, 1/mean(data)))
  qqline(data, distribution = function(p) qexp(p, 1/mean(data)),
       prob = c(0.1, 0.6), col = 2)

  ### Check the first 10 observations
  cat("First 10 obs of original sample: ", data[1:10], "\n")

## First 10 obs of original sample:  12 10 2 6 4 9 3 19 1 2

  ### Compute 10000 bootstrap samples for the mean
  data.b.mean <- b.mean(data, 10000)
  
  ### Check the standard error
  cat("se(xbar) = ", data.b.mean$std.err, "\n")

## se(xbar) =  0.9225718

  ### Look at the bootstrap distribution
  hist(data.b.mean$means)

  qqnorm(data.b.mean$means)
  qqline(data.b.mean$means)

  ### Compute 10000 bootstrap samples for the median
  data.b.median <- b.median(data, 10000)
  
  ### Check the standard error
  cat("se(m) = ", data.b.median$std.err, "\n")

## se(m) =  0.8487821

  ### Look at the bootstrap distribution
  hist(data.b.median$medians)

  qqnorm(data.b.median$medians)
  qqline(data.b.median$medians)

If we use the same “data” as before and repeat the bootstrapping, we get a different results because of different samples.

The boot Package

R has a boot package that makes it easy to implement bootstrapping on “complex” statistics.

Mean

We can use boot to replicate the bootstrapping that was carried out above.

  p_load(boot)
  ### The boot.mean takes the original data (d) and computes the mean of the
  ### selected, indexed (i) values (d[i])
  boot.mean <- function(d, i){mean(d[i])}
  data.boot.mean <- boot(data, boot.mean, R=10000)
  summary(data.boot.mean)

##           Length Class  Mode     
## t0            1  -none- numeric  
## t         10000  -none- numeric  
## R             1  -none- numeric  
## data        100  -none- numeric  
## seed        626  -none- numeric  
## statistic     1  -none- function 
## sim           1  -none- character
## call          4  -none- call     
## stype         1  -none- character
## strata      100  -none- numeric  
## weights     100  -none- numeric

  data.boot.mean

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = boot.mean, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*     9.62 -0.005246   0.9278368

  hist(data.boot.mean$t)

  qqnorm(data.boot.mean$t)
  qqline(data.boot.mean$t)

  boot.ci(data.boot.mean)

## Warning in boot.ci(data.boot.mean): bootstrap variances needed for studentized
## intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = data.boot.mean)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 7.807, 11.444 )   ( 7.740, 11.380 )  
## 
## Level     Percentile            BCa          
## 95%   ( 7.86, 11.50 )   ( 7.98, 11.68 )  
## Calculations and Intervals on Original Scale

Correlation

Computing a confidence interval for the sample correlation can be challenging. Bootstrapping makes this relatively easy.

  htwt <- read.csv("HtWt.csv")
  head(htwt)

##   Height Weight Group
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  ### Create a function to compute the correlation of reseampled data
  f <- function(d, i){corr(d[i,])}
  boot.htwt.corr <- boot(htwt[,1:2], f, R=10000)
  
  boot.htwt.corr

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = htwt[, 1:2], statistic = f, R = 10000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.9868154 -0.0002478202 0.005661326

  qqnorm(boot.htwt.corr$t)
  qqline(boot.htwt.corr$t)

  boot.ci(boot.htwt.corr)

## Warning in boot.ci(boot.htwt.corr): bootstrap variances needed for studentized
## intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.htwt.corr)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.9760,  0.9982 )   ( 0.9788,  1.0003 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.9734,  0.9948 )   ( 0.9702,  0.9941 )  
## Calculations and Intervals on Original Scale

Slope and Intercept

We can bootstrap the slope and intercept of a simple regression model.

  htwt <- read.csv("http://facweb1.redlands.edu/fac/jim_bentley/data/math 312/bootstrap/htwt.csv")
  head(htwt)

##   Height Weight Group
## 1     64    159     1
## 2     63    155     2
## 3     67    157     2
## 4     60    125     1
## 5     52    103     2
## 6     58    122     2

  ### Create a function to compute the correlation of reseampled data
  f <- function(d, i){coef(lm(d$Weight[i] ~ d$Height[i]))}
  boot.htwt.coef <- boot(htwt[,1:2], f, R=10000)
  
  boot.htwt.coef

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = htwt[, 1:2], statistic = f, R = 10000)
## 
## 
## Bootstrap Statistics :
##        original       bias    std. error
## t1* -173.459595 -0.588689709  12.6002354
## t2*    5.041217  0.009350654   0.2007861

  qqnorm(boot.htwt.coef$t[,1])
  qqline(boot.htwt.coef$t[,1])

  qqnorm(boot.htwt.coef$t[,2])
  qqline(boot.htwt.coef$t[,2])